The objective of the present work is to develop a smart keyboard to enable people to be more effective on their mobile devices. A predictive text model has been developed, giving the user of mobile device three options for what the next word might be.
To develop such a model, a large corpus of text documents has been created by merging three different types of english sources: blogs, news and twitts. The raw data, Capstone Data Set, was provided by John Hopkins University and the whole code used for creating this report and the proposed model is available on Github.
The research question is centerd on : “How can an efficient text predicitve model be developed on the base of publicly available data such as blogs, news wires and tweets ?”. It then implies that the methodology developed in this work can be replicated in any language, if needed.
The data is composed of more that four millions documents, the extact total being 4’269’678. The following table indicates the different statistics related to the three different file sources. The results highlight that some blog documents appear to be very long when compared to the medians of all types of documents.
A preliminary investigation was conducted to understand data properties, patterns and suggest modelling strategies. The following histograms demonstrate the distribution of words according to the different file sources.
The distribution of blogs documents is positively skewed (right-skewed) highlighting the fact that a few blogs contain a lot of words. It further indicates the characteristics of a poisson distribution.The distribution of news documents appear to be slightly bimodal. The sharp contrast of both with the distribution of tweets that are much shorter in termes of number of words per document should be noted.
The following violin plots show the full distribution of words across each source. The probability density is shown at different values. The median, interquartiles ranges and other statistics can be consulted by hovering on them.